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Abstract 



This work revisits existing algorithms for the QR factorization of rect- 
angular matrices composed of p x q tiles, where p > q. Within this frame- 
work, we study the critical paths and performance of algorithms such as 
Sameh-Kuck, Fibonacci, Greedy, and those found within PLASMA. 
Although neither Fibonacci nor Greedy is optimal, both are shown to 
be asymptotically optimal for all matrices of size p = q 2 f(q), where / is 
any function such that lim+oo / = 0. This novel and important complex- 
ity result applies to all matrices where p and q are proportional, p — Xq, 
with A > 1, thereby encompassing many important situations in practice 
(least squares) . We provide an extensive set of experiments that show the 
superiority of the new algorithms for tall matrices. 

1 Introduction 

Given an m-by-n matrix A with n < m, we consider the computation of its QR 
factorization, which is the factorization A = QR, where Q is an m-by-n unitary 
matrix (Q H Q = I n ), and R is upper triangular. 

The QR factorization is the time consuming stage of some important nu- 
merical computations. The QR factorization of an m-by-n matrix with n < m 
is needed for solving a linear least squares with m equations (observations) and 
n unknowns. The QR factorization of an m-by-n matrix with n < m is used 
to compute an orthogonal basis (the Q-factor) of the column span of the initial 
matrix A. For example, all block iterative methods (used to solve large sparse 
linear systems of equations or computing some relevant eigenvalues of such sys- 
tems) require orthogonalizing a set of vectors at each step of the process. For 
these two usage examples, while n < m, n can range from n <C m to n J$ m. 
We note that the extreme case n = m is also relevant: the QR factorization of 
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a matrix can be used to solve (square) linear systems of equations (in that case 
n = m). While this requires twice as many flops as an LU factorization, using a 
QR factorization (a) is unconditionally stable (Gaussian elimination with par- 
tial pivoting or pairwise pivoting is not) and (b) avoids pivoting so it may well 
be faster in some cases (despite requiring twice as many flops). 

To obtain a QR factorization, we consider algorithms which apply a sequence 
of m-by-m unitary transformations, Ui, (U^Ui = I,), i = 1, ...,£, on the 
left of the matrix A, such that after I transformations the resulting matrix 
R = U( ...U\A is upper triangular, in which case, R is indeed the R- factor 
of the QR factorization. The Q-factor (if needed) can then be obtained by 
computing Q = U^ . . . Uf . These types of algorithms are in regular use, e.g. in 
the LAPACK and ScaLAPACK libraries, and are favored over others algorithms 
(Cholcsky QR or Gram-Schmidt) for their stability. 

The unitary transformation Ui is chosen so as to introduce some zeros in the 
current update matrix Ui-i . . . U\A. The two basic transformations are Givens 
rotations and Householder reflections. One Givens rotation introduces one addi- 
tional zero; the whole triangularization requires mn-n(n+l)/2 Givens rotations 
for n < m. One elementary Householder reflection simultaneously introduces 
m — i zeros in position z + f to m in column i; the whole triangularization requires 
n Householder reflections for n < m. (See LAPACK subroutine GEQR2.) The 
LAPACK GEQRT subroutine constructs a compact WY representation to ap- 
ply a sequence of ib Householder reflections, this enables one to introduce the 
appropriate zeros in if, consecutive columns and thus leverage optimized Level 3 
BLAS subroutines during the update. The blocking of Givens rotations is also 
possible but is more costly in terms of flops. 

The main interest of Givens rotations over Householder transformations is 
that one can concurrently introduce zeros using disjoint pairs of rows, in other 
words, two transformations Ui and Ui+\ may be applicable concurrently This 
is not possible using the original Householder reflection algorithm since the 
transformations work on whole columns and thus does not exhibit this type of 
intrinsic parallelism forcing this kind of Householder reflections to be applied 
sequentially The advantage of Householder reflections over Givens rotations is 
that, first, Householder reflections perform less flops, and second, the compact 
WY transformation enables high sequential performance of the algorithm. In 
a multicore setting, where data locality and parallelism are crucial algorithmic 
characteristics for enabling performance, the tiled QR factorization algorithm 
combines both ideas: use of Householder reflections for high sequential perfor- 
mance and use of a scheme ala Givens rotations to enable parallelism within 
cores. In essence, one can think (i) cither of the tiled QR factorization as a 
Givens rotation scheme but on tiles (nib-hy-nt, submatriccs) instead of on scalars 
(I-by-I submatriccs) as in the original scheme, (ii) or of it as a blocked House- 
holder reflection scheme where each reflection is confined to an extent much less 
than the full column span, which enables concurrency with other reflections. 

Tiled QR factorization in the context of multicore architectures has been in- 
troduced in [5, 6, 15] . Initially the focus was on square matrices and the sequence 
of unitary transformations presented was analogous to Sameh-Kuck [16], which 
corresponds to reducing the panels with flat trees. The possibility of using any 
tree in order to either maximize parallelism or minimize communication is ex- 
plained in [10]. 

The focus of this manuscript is in maximizing parallelism. Stemming from 



2 



the observation that a binary tree is best for tall and skinny matrices and a 
flat tree is best for square matrices, Hadri et al. [12], propose to use trees which 
combine flat trees at the bottom level with a binary tree at the top level in order 
to exhibit more parallelism. Our theoretical and experimental work explains 
that we can adapt Fibonacci [14] and Greedy [7, 8] to tiles, resulting in yet 
better algorithms in terms of parallelism. Moreover our new algorithms do not 
have any tuning parameter such as the domain size in the case of [12]. 

The focus of this manuscript is not in trying to reduce communication (data 
movement between memory hierarchy) to a minimum. Relatively low level of 
communication is naturally achieved by the algorithm by tiling the operations. 
How to optimize the trade-off communication and parallelism is out of the scope 
of this manuscript. For this reason, we consider square tiling with constant tile 
size. In order to increase parallelism, we use so called TT kernels which are more 
parallel but performs potentially more communication and are less efficient in 
sequential than the TS kernels. (A longer discussion on the issue can be found 
in Section 2.1.) This is another trade-off that we made and we opted for as 
much parallelism as possible. 

We can quote three manuscripts who use some kind of rectangular tiling. 
Dcmmcl et al. [10] sequentially process rectangular tiles with a recursive QR fac- 
torization algorithm (which is communication optimal in sequential) and then 
uses reduction trees to perform the QR factorization in parallel. Experimental 
results are given using a binary tree on tall and skinny matrices. The same algo- 
rithms is used on the grid (grid of clusters) in [1]. The ScaLAPACK algorithm 
is used independently on each cluster on a large parallel distributed rectangular 
tile; then, a binary tree is used at the grid level among the clusters. Demmel 
et al. [9] use a binary tree on top of a flat tree for tall and skinny matrices. 
The binary tree is therefore used on rectangular tiles. The flat tree is used 
locally on the nodes to reduce sequential communication, while the binary tree 
is used within the nodes to increase parallelism. Finally, the approach of Hadri 
et al. [12] is not only interesting in term of parallelism to tackle various matrix 
shapes, it is also interesting in reducing communication (same approach in this 
case as in [9]) and enables the use of TS kernels. 

The sequential kernels of the Tiled QR factorization (executed on a core) 
are made of standard blocked algorithms ala LAPACK encoded in kernels; the 
development of these kernels is well understood. The focus of this manuscript is 
on improving the overall degree of parallelism of the algorithm. Given a p-hy-q 
tile matrix, we seek to find an appropriate sequence of unitary transformations 
on the tiled matrix so as to maximize parallelism (minimize critical path length). 
We will get our inspiration in previous work from the 70s/80s on Givcns rota- 
tions where the question was somewhat related: given an m-by-n matrix, find 
an appropriate sequence of Givens rotations as to maximize parallelism. This 
question is essentially answered in [7, 8, 14, 16]; we call this class of algorithms 
"coarse-grain algorithms." 

Working with tiles instead of scalars, we introduce four essential differences 
between the analysis and the reality of the tiled algorithms and the coarse-grain 
algorithms. First, while there are only two states for a scalar (nonzero or zero), 
a tile can be in three states (zero, triangle or full). Second, there are more 
operations available on tiles to introduce zeros; we have a total of three differ- 
ent tasks which can introduce zeros in a matrix. Third, the factorization and 
the update are dissociated to enable factorization stages to overlap with up- 
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date stages. In the coarse-grain algorithm, the factorization and the associated 
update are considered as a single stage. Fourth and last, while coarse-grain 
algorithms have only one task, we end up with six different tasks, which have 
different computational weights; this dramatically complicates the critical path 
analysis of the tiled algorithms. 

While the Greedy algorithm is optimal for "coarse-grain algorithms", we 
show that it is not in the case of tiled algorithms. We are unable to devise 
an optimal algorithm at this point, but we can prove that both Greedy and 
Fibonacci are asymptotically optimal for all matrices of size p = q 2 f{q), where 
/ is any function such that lim +00 / = 0. This result applies to all matrices 
where p and q are proportional, p = Xq, with A > 1, thereby encompassing 
many important situations in practice (least squares). 

This manuscript is organized as follows. Section 2 reviews the numerical 
kernels needed to perform a tiled QR factorization, and introduces elimination 
lists, which enable us to formally define tiled algorithms. Section 3 presents 
the core algorithmic contributions of this manuscript. One major result is the 
asymptotic optimality of two new tiled algorithms, Fibonacci and Greedy. 
Section 4 is devoted to numerical experiments on multicorc platforms. For 
tall matrices (p > 2q), these experiments confirm the superiority of the new 
algorithms over state-of-the-art solutions of the PLASMA library [5, 6, 10, 12]. 
Finally, we provide some concluding remarks in Section 5. 

2 The QR factorization algorithm 

Tiled algorithms are expressed in terms of tile operations rather than elementary 
operations. Each tile is of size n^xn^, where rib is a parameter tuned to squeeze 
the most out of arithmetic units and memory hierarchy. Typically, nj, ranges 
from 80 to 200 on state-of-the-art machines [3]. Algorithm 1 outlines a naive 
tiled QR algorithm, where loop indices represent tiles: 



Algorithm 1: Naive QR algorithm for a tiled p x q matrix. 

for k = 1 to min(p, q) do 
for i = k + 1 to p do 

elim(i,piv(i, k), k) 



In Algorithm 1, k is the panel index, and elim(i,piv(i, k), k) is an orthogonal 
transformation that combines rows i and piv(i, k) to zero out the tile in position 
(i, k). However, this formulation is somewhat misleading, as there is much more 
freedom for QR factorization algorithms than, say, for Cholesky algorithms (and 
contrarily to LU elimination algorithms, there are no numerical stability issues). 
For instance in column 1, the algorithm must eliminate all tiles (i, 1) where 
i > 1, but it can do so in several ways. Take p = 6. Algorithm 1 uses the 
transformations 

elim(2, 1, 1), elim(3, 1, 1), eZim(4, 1, 1), elim(5, 1, 1), elim(6, 1, 1) 

But the following scheme is also valid: 

elim(3, 1, 1), elim(6, 4, 1), eZim(2, 1, 1), elim(5, 4, 1), elim(4, 1, 1) 
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Operation 


Panel 


Update 




Name 


Cost 


Name 


Cost 


Factor square into triangle 


GEQRT 


4 


UNMQR 


6 


Zero square with triangle on top 


TSQRT 


6 


TSMQR 


12 


Zero triangle with triangle on top 


TTQRT 


2 


TTMQR 


6 



Table 1: Kernels for tiled QR. The unit of time is floating-point operations, 
where rib is the blocksize. 



In this latter scheme, the first two transformations elim(3, 1, 1) and elim(6, 4, 1) 
use distinct pairs of rows, and they can execute in parallel. On the contrary, 
elim(3, 1, 1) and elim(2, 1, 1) use the same pivot row and must be sequentialized. 
To complicate matters, it is possible to have two orthogonal transformations 
that execute in parallel but involve zeroing a tile in two different columns. For 
instance we can add elim(6, 5, 2) to the previous transformations and run it 
concurrently with, say, elim(2, 1, 1). Any tiled QR algorithm will be character- 
ized by an elimination list, which provides the ordered list of the transformations 
used to zero out all the tiles below the diagonal. This elimination list must obey 
certain conditions so that the factorization is valid. For instance, elim(6, 5, 2) 
must follow eZim(6,4, 1) and elim(5,4, 1) in the previous list, because there is a 
flow dependence between these transformations. Note that, although the elim- 
ination list is given as a totally ordered sequence, some transformations can 
execute in parallel, provided that they are not linked by a dependence: in the 
example, eZim(6,4, 1) and elim(2, 1, 1) could have been swapped, and the elim- 
ination list would still be valid. 

Before formally stating the conditions that guarantee the validity of (the 
elimination list of) an algorithm, we explain how orthogonal transformations 
can be implemented. 

2.1 Kernels 

To implement a given orthogonal transformation elim(i,piv(i,k),k), one can 
use six different kernels, whose costs are given in Table 1. In this table, the unit 

3 

of time is the time to perform floating-point operations. 

There are two main possibilities to implement an orthogonal transformation 
elim(i,piv(i, k), k): The first version eliminates tile (i, k) with the TS (Triangle 
on top of square) kernels, as shown in Algorithm 2: 



Algorithm 2: Elimination elim(i 7 piv(i,k),k) via TS (Triangle on top of 
square) kernels. 

GEQRT{piv{i,k),k) 
TSQRT(i,piv(i, fc), k) 
for j = k + 1 to q do 

UNMQR{piv(i,k),k,j) 

TSMQR(i,piv(i,k),k,j) 



Here the tile panel (piv(i, k), k) is factored into a triangle (with GEQRT). 
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The transformation is applied to subsequent tiles (piv(i,k),j), j > k, in row 
piv(i,k) (with UNMQR). Tile (i, k) is zeroed out (with TSQRT), and subse- 
quent tiles j > fc, in row i are updated (with TSMQR). The flop count 
is 4 + 6 + (6 + 12) (g — k) = 10 + 18 (g — k) (expressed in same time unit as in 
Table 1). Dependencies are the following: 

GEQRT(piv(i, fc), k) -< TSQRT{i,piv{i,k),k) 

GEQRT(piv(i, k), k) -< UNMQR(piv{i, k), k,j) for j > k 

UNMQR(piv(i, k), k,j) -< TSMQR(i,piv(i, k),k,j) for j > k 

TSQRT(i,piv(i, k), k) -< TSMQR(i,piv(i, k),k,j) for j > k 

Note that TSQRT{i,piv{i,k),k) and UNMQR(piv(i,k),k,j) can be executed 
in parallel, as well as UNMQR operations on different columns > k. With 
an unbounded number of processors, the parallel time is thus 4 + 6 + f 2 = 22 
time- units. 

The second approach to implement the orthogonal transformation 
elim(i,piv(i, k), k) is with the TT (Triangle on top of triangle) kernels, as shown 
in Algorithm 3: 



Algorithm 3: Elimination elim(i,piv(i,k),k) via TT (Triangle on top of 
triangle) kernels. 

GEQRT(piv(i,k),k) 

GEQRT(i, k) 

for j = k + 1 to q do 

UNMQR{piv(i,k),k,j) 

UNMQR(i,k,j) 

TTQRT(i,piv(i, fc), k) 
for j = k + 1 to q do 

TTMQR(i,piv(i,k),k,j) 



Here both tiles (piv(i,k),k) and («,fc) are factored into a triangle (with 
GEQRT). The corresponding transformations are applied to subsequent tiles 
(piv(i,k),j) and j > fc, in both rows piv(i,k) and i (with UNMQR). Tile 

{i,k) is zeroed out (with TTQRT), and subsequent tiles j > k, in row i 

are updated (with TTMQR). The flop count is 2(4 + 6(q - k)) + 2 + 6(q -k) = 
10 + 18(g — k), just as before. Dependencies are the following: 

GEQRT(piv(i, k),k) ~< UNMQR(piv(i, k),k, j) for j > k 

GEQRT(i, k) < UNMQR(i, k,j) for j > k 
GEQRT (piv(i,k),k) -< TTQRT(i,piv(i,k),k) 
GEQRT(i, k) -< TTQRT(i,piv{i,k),k) 

TTQRT(i,piv(i,k),k) -< TTMQR(i,piv(i,k),k,j) for j > k 

UNMQR(piv(i, k),k,j) -< TTMQR(i,piv(i, k),k,j) for j > k 

UNMQR(i, k,j) -< TTMQR(i,piv(i, k), k,j) for j > k 

Now the factor operations in row piv(i, k) and i can be executed in parallel. 
Moreover, the UNMQR updates can be run in parallel with the TTQRT fac- 
torization. Thus, with an unbounded number of processors, the parallel time is 
4 + 6 + 6 = 16 time-units. 
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In Algorithm 2 and 3, it is understood that if a tile is already in triangle 
form, then the associated GEQRT and update kernels are not applied. 

All the new algorithms introduced in this manuscript are based on TT (ker- 
nels. From an algorithmic perspective, TT kernels are more appealing than TS 
kernels, as they offer more parallelism. More precisely, we can always break a 
TS kernel into two TT kernels: We can replace a TSQRT(i,piv(i,k),k) (fol- 
lowing a GEQRT(piv(i, k),k)) by a GEQRT(i, k) and a TTQRT(i,piv(i, k),k). 
A similar transformation can be made for the updates. Hence a TO-based tiled 
algorithm can always be executed with TT kernels, while the converse is not 
true. However, the TS kernels provide more data locality, they benefit form a 
very efficient implementation (see Section 4), and several existing algorithms use 
these kernels. For all these reasons, and for comprehensiveness, our experiments 
will compare approaches based on both kernel types. 

Currently (April 2011), the PLASMA library only contains TS kernels. We 
have mapped the PLASMA algorithm to TT kernel algorithm using this conver- 
sion. Going from a TS kernel algorithm to a TT kernel algorithm is implicitly 
done by Hadri et al. [11] when going from their "Semi-Parallel" to their "Fully- 
Parallel" algorithms. 

2.2 Elimination lists 

As stated above, any algorithm factorizing a tiled matrix of size p x q is charac- 
terized by its elimination list. Obviously, the algorithm must zero out all tiles 
below the diagonal: for each tile (i,k), i > fc, 1 < k < min(p, q), the list must 
contain exactly one entry elim(i,-k, k), where * denotes some row index piv(i, k) 
. There are two conditions for a transformation elim(i,piv(i, k), k) to be valid: 

• both rows i and piv{i, k) must be ready, meaning that all their tiles left 
of the panel (of indices (i,k') and (piv(i, k), k') for 1 < k' < k) must 
have already been zeroed out: all transformations elim(i,piv(i, k'), k') and 
elim(piv(i,k),piv(piv(i,k),k'),k') must precede elim(i,piv(i,k),k) in the 
elimination list 

• vow piv(i, k) must be a potential annihilator, meaning that tile {piv{i, k), k) 
has not been zeroed out yet: 

the transformation elim(piv(i^k),piv(piv(i,k),k),k) must follow 
elim(i,piv(i, k), k) in the elimination list 

Any algorithm that factorizes the tiled matrix obeying these conditions is called 
a generic tiled algorithm in the following. 

A critical result is that no matter what elimination list is used the total 
weight of the tasks for performing a tiled QR factorization algorithm is constant 
and equal to 6pq 2 — 2q 3 . Using our unit task weight of n^/3, with m = put, and 
n = qrib, we obtain 2mn 2 — 2/3n 3 flops which is the exact same number as for a 
standard Householder reflection algorithm as found in LAPACK (e.g., [4]). We 
note that this results is true if (a) we use TS kernels as well and if (b) we use 
any tiling, (e.g. rectangular tiles). 

2.3 Execution schemes 

In essence, the execution of a generic tiled algorithm is fully determined by 
its elimination list. This list is statically given as input to the scheduler, and 
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the execution progresses dynamically, with the scheduler executing all required 
transformations as soon as possible. More precisely, each transformation in- 
volves several kernels, whose execution starts as soon as they are ready, i.e., as 
soon as all dependencies have been enforced. Recall that a tile (i, fc) can be ze- 
roed out only after all tiles {i, fc'), with k' < k, have been zeroed out. Execution 
progresses as follows: 

• Before being ready for elimination, tile (i, k), i > k, must be updated fc — 1 
times, in order to zero out the fc — 1 tiles to its left (of index (i, k'), fc' < fc). 
The last update is a transformation TTMQR(i,piv(i, k — l),fc — 1, fc) for 
some row index piv(i, k — 1) such that elim(i,piv(i, k — 1), fc — 1) belongs 
to the elimination list. When completed, this transformation triggers the 
transformation GEQRT(i,k), which can be executed immediately after 
the completion of the TTMQR. In turn, GEQRT(i, fc) triggers all updates 
UNMQR(i, fc, j) for all j > fc. These updates are executed as soon as they 
are ready for execution. 

• The elimination elim(i,piv(i, fc), fc) is performed as soon as possible when 
both rows i and piv(i, fc) are ready. Just after the completion of 
GEQRT(i, fc) and GEQRT(piv(i, fc), fc), kernel TTQRT(i, piv(i,k),k) is 
launched. When finished, it triggers the updates TTMQR(i,piv(i, fc), fc, j) 
for all j > k. 

Obviously, the degree of parallelism that can be achieved depends upon 
the eliminations that are chosen. For instance, if all eliminations in a given 
column use the same factor tile, they will be sequentialized. This corresponds 
to the flat tree elimination scheme described below: in each column fc, it uses 
elim{i, fc, fc) for all i > fc. On the contrary, two eliminations elim(i,piv(i, fc), fc) 
and elim(i',piv(i',k),k) in the same column can be fully parallelized provided 
that they involve four different rows. Finally, note that several eliminations 
can be initiated in different columns simultaneously, provided that they involve 
different pairs of rows, and that all these rows are ready (i.e., they have the 
desired number of leftmost zeros). 

The following lemma will prove very useful; it states that we can assume 
w.l.o.g. that each tile is zeroed out by a tile above it, closer to the diagonal. 

Lemma 1. Any generic tiled algorithm can be modified, without changing its ex- 
ecution time, so that all eliminations elim(i,piv(i,k),k) satisfy to i > piv(i,k). 

Proof. Define a reverse elimination as an elimination elim(i,piv{i,k),k) where 
i < piv(i, fc). Consider a generic tiled algorithm whose elimination list contains 
some reverse eliminations. Let fco be the first column to contain one of them. 
Let io be the largest row index involved in a reverse elimination in column 
fco. The elimination list in column fco may contain several reverse eliminations 
elim(ii, io, fco), elim(i2,io,ko), elim(i r ,io,ko), in that order, before row io 
is eventually zeroed out by the transformation elim(io, piv(io,ko),ko). Note 
that piv(io, fco) < io by definition of io- We modify the algorithm by exchanging 
the roles of rows io and i\ in column fco: the elimination list now includes 
elim(io, «i, fco), elim(i2,i\,ko), elim(i r , i\, fco), and 

elim(ii,piv(io-, fco), fco)- All dependencies are preserved, and the execution time 
is unchanged. Now the largest row index involved in a reverse elimination in 
column fco is strictly smaller than io, and we repeat the procedure until there 
does not remain any reverse elimination in column fco. We proceed inductively 
to the following columns, until all reverse eliminations have been suppressed. □ 
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3 Critical paths 



In this section we describe several generic tiled algorithms, and we provide their 
critical paths, as well as optimality results. These algorithms are inspired by 
algorithms that have been introduced twenty to thirty years ago [16, 14, 8, 7], 
albeit for a much simpler, coarse-grain model. In this "old" model, the time-unit 
is the time needed to execute an orthogonal transformation across two matrix 
rows, regardless of the position of the zero to be created, hence regardless of the 
length of these rows. Although the granularity is much coarser in this model, 
any existing algorithm for the old model can be transformed into a generic 
tiled algorithm, just by enforcing the very same elimination list provided by the 
algorithm. Critical paths are obtained using a discrete event based simulator 
specially developed to this end, based on the Simgrid framework [17] . It carefully 
handles dependencies across tiles, and allows for the analysis of both static and 
dynamic algorithms. 1 

3.1 Coarse-grain algorithms 

We start with a short description of three algorithms for the coarse-grain model. 
These algorithms are illustrated in Table 2 for a 15 x 6 matrix. 

Sameh-Kuck algorithm The Sameh-Kuck algorithm [16] uses the panel 
row for all eliminations in each column, starting from below the diagonal and 
proceeding downwards. Time-steps indicate the time-unit at which the elimi- 
nation can be done, assuming unbounded resources. Formally, the elimination 
list is 



Fibonacci algorithm The Fibonacci algorithm is the Fibonacci scheme of 
order 1 in [14]. Let coarse(i, k) be the time-step at which tile (i,k), i > k, is 
zeroed out. These values are computed as follows. In the first column, there 
are one 5, two 4's, three 3's, four 2's and four l's (we would have had five l's 
with p = 16). Given x as the least integer such that x(x + l)/2 > p — 1, we have 
coarse(i, 1) = x — y + 1 where y is the least integer such that i < y(y + l)/2 + 1. 
Let the row indices of the z tiles that are zeroed out at step s, 1 < s < x, range 
from i to i+z — 1. The elimination list for these tiles is elim(i+j,piv(i+j, 1), 1), 
with piv(i + j) = i + j — z for < j < z — 1. In other words, to eliminate a 
bunch of z consecutive tiles at the same time-step, the algorithm uses the z rows 
above them, pairing them in the natural order. Now the elimination scheme of 
the next column is the same as that of the previous column, shifted down by 
one row, and adding two time-units: coarse{i, k) = coarse(i — 1, k — 1) + 2, while 
the pairing obeys the same rule. 

Greedy algorithm At each step, the Greedy algorithm [8, 7] eliminates as 
many tiles as possible in each column, starting with bottom rows. The pairing 
for the eliminations is done exactly as for Fibonacci. There is no closed-form 

The discrete event based simulator, together with the code for all tiled algorithms, is 
publicly available at http://graal.ens-lyon.fr/-mjacquel/tiledQR.html 




elim(i, k, k), i = k + 1, k + 2, . . . , p I, k = 1, 2, . . . , min(p, q) 
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formula to compute coarse(i, k), the time-step at which tile (i, k) is eliminated, 
but it is possible to provide recursive expressions (see [8, 7]). 



(a) Sameh-Kuck 
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(c) Greedy 
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Table 2: Time-steps for coarse-grain algorithms. 



Consider a rectangular p x q matrix, with p > q. With the coarse-grain 
model, the critical path of Sameh-Kuck is p + q — 2, and that of Fibonacci 
is x + 2q — 2, where x is the least integer such that x{x + l)/2 > p — 1. The 
critical path of Greedy is unknown, but two important results are known: 
(i) the critical path of Greedy is optimal; (ii) its value tends to 2q if p is 
negligible in front of q 2 , i.e., if we have p = q 2 f(q) where / is any function such 
that lim +00 / = (and f(q) > 1/q so that p > q). In particular, let p and q 
be proportional, p = Xq, with a constant A > 1: Fibonacci is asymptotically 
optimal, because x is of the order of ^/q, hence its critical path is 2q + o{q). On 
the contrary, Sameh-Kuck is not asymptotically optimal since its critical path 
is (1 + \)q — 2. For square q x q matrices, critical paths are slightly different 
(2q - 3 for Sameh-Kuck, x + 2q - 4 for Fibonacci), but the important result 
is that all three algorithms are asymptotically optimal in that case. 

3.2 Tiled algorithms 

As stated above, each coarse-grain algorithm can be transformed into a tiled 
algorithm, simply by keeping the same elimination list, and triggering the exe- 
cution of each kernel as soon as possible. However, because the weights of the 
factor and update kernels are not the same, it is much more difficult to compute 
the critical paths of the transformed (tiled) algorithms. Table 3 is the counter- 
part of Table 2, and depicts the time-steps at which tiles are actually zeroed 
out. Note that the tiled version of Sameh-Kuck is indeed the FlatTree algo- 
rithm in PLASMA [5, 6], and we have renamed it accordingly. As an example, 
Algorithm 4 shows the Greedy algorithm for the tiled model. 

A first (and quite unexpected) result is that Greedy is no longer optimal, 
as shown in the first two columns of Table 4a for a 15 x 2 matrix. In each 
column and at each step, "the Asap algorithm" starts the elimination of a tile 
as soon as there are at least two rows ready for the transformation. When 
s > 2 eliminations can start simultaneously, Asap pairs the 2s rows just as 
Fibonacci and Greedy, the first row (closest to the diagonal) with row s + 1, 
the second row with row s + 2, and so on. As a matter of a fact, when processing 
the second column, both Asap and Greedy begin with the elimination of lines 
10 to 15 (at time step 20). However, once tiles (13,2), (14,2) and (15,2) are 
zeroed out (i.e. at time step 22), Asap eliminates 4 zeros, in rows 9 through 
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Algorithm 4: Greedy algorithm via TT kernels. 



for j = l to q do 

/* nz(j) is the number of tiles which have been eliminated in column j */ 

nZ(j) = 

/* nT(j) is the number of tiles which have been triangularized in column j */ 

nT(j) = 

while column q is not finished do 
for j — q down to 1 do 
if j 1 then 

/* Triangularize the first column if not yet done */ 

"T ncw = nT(j) + (p - nT(j)) 
if p - nT(j) > then 

for k — p down to 1 do 
GEQRT(kJ) 
for jj = j -f- 1 to q do 
L UNMQR(k,j,jj) 



else 

/* Triangularize every tile having a zero in the previous column */ 

nT now = nZ(j - 1) 

for k = nT(j) to nT ncw - 1 do 

GEQRT(p- kj) 

for j j = j + 1 to q do 
L UNMQR(p-k,j,jj) 

/* Eliminate every tile triangularized in the previous step */ 

nT(j) - nZij) 
nZ nc „ = nZ(j) + L ^ 

for kk — nZ(j) to nZ ncw — 1 do 

piv(p — kk) — p — kk — nZ new + nZ(j) 
TTQRT(p - kk,piv(p - kk),j) 
for jj — j : + 1 to q do 
|_ TTMQRip - kk, piv{p - kk)J. jj) 

/* Update the number of triangularized and eliminated tiles at the next step 
*/ 

nTQ) - nT ncw 
nZ(j) — nZnew 



12. On the contrary, Greedy waits until time step 26 to eliminate 6 zeros in 
rows 6 through 12. In a sense, Asap is the counterpart of Greedy at the tile 
level. However, Asap is not optimal either, as shown in Table 4a for a 15 x 3 
matrix. On larger examples, the critical path of Greedy is better than that of 
Asap, as shown in Table 4b. 

We have seen that, for a 15 x 2 matrix, Asap is better than Greedy and that, 
for a 15 x 3 matrix, Greedy is better than Asap. We can further improve upon 
Greedy in the 15 x 3 case. We consider the GRASAP(fc) algorithm defined as: 
following the Greedy algorithm up from columns 1 to q — k and then switching 
in Asap mode for the last k columns. Grasap(O) is Greedy, while Grasap(q) 
is Asap. In Table 4a(c), we give the results for Grasap(I). In this case (a 15 x 3 
matrix), Grasap(I) is better than Greedy. Grasap(I) finishes at time-step 
62, while Greedy finishes at time-step 64. Of course it would be interesting 
to determine the best value of k as a function of p and q, for the execution of 
GRASAP(fc) on a p x q matrix. 

Wc have a closed-form formula for the critical path of tiled FlatTree, but 
not for that of tiled Fibonacci (contrarily to the coarse-grain case). But we 
provide an asymptotic expression, both for Fibonacci and for Greedy. More 
importantly, wc show that both tiled algorithms arc asymptotically optimal. 
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Table 3: Time-steps for tiled algorithms. 



(a) GREEDY nor ASAP arc optimal. 



(b) GREEDY generally outperforms ASAP. 
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Table 4: Neither Greedy nor Asap are optimal. 



We state our main result: 

Theorem 1. For a tiled matrix of size p x q, where p > q: 

1. The critical path length of FlatTree is 

2p + 2 ifp>q=l 
6p+l6q-22 ifp>q>l 
22p-24 ifp = q>l 

2. The critical path length of FIBONACCI is at most 22q + 6\y / 2p \ , and that 
of Greedy is at most 22q + 6 |~log 2 p] 

3. The optimal critical path length is at least 22q — 30 

4- Fibonacci is asymptotically optimal if p = q 2 f(q), where lim +oc / = 
5. Greedy is asymptotically optimal iflog 2 p = qf(q), where lim +oc / = 

Proof. Proof of (1). Consider first the case p > q = 1. We shall proceed by 
induction on p to show that the critical path of FlatTree is of length 2p + 2, 
If p = 1, then from Table 1 the result is obtained since only GEQRT(1, 1) 
is required. With the base case established, now assume that this holds for all 
p— 1 > q = 1. Thus at time t = 2(p— 1)+2 = 2p, we have that for allp— 1 > i > 1 
tile (i, 1) has been factorized into a triangle and for all p — 1 > i > 1, tile (i, 1) 
has been zeroed out. Therefore, tile (p, 1) will be zeroed out with TTQRT(p, 1) 
at time t + 2 = 2(p - 1) + 2 + 2 = 2p + 2. 

Consider now the case p > q > 1. We show by induction on k that tile 
for i > k > 2, is zeroed out in FlatTree at time unit 6i + 16k — 22. 
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For k = 2, tile (2, 2) is updated from step k = 1 at time 4 + 6 + 6 = 16, 
and it is factored into a triangle at time 20. Tile (3, 2) is updated from step 
k = 1 at time 22 factored into a triangle at time 26 and then zeroed out at time 
26 + 2 = 28 = 6x3 + 16x2 — 22. A new tile in column 2 is zeroed out every 6 
time units, hence the initialization of the induction for k = 2. Assume now that 
the formula holds up to column k, and let t = 6(fc + 1) + 16fc — 22 be the time 
at which tile (k + 1, k) is zeroed out. Tile (k + 1, k + 1) is updated from step k 
at time t— 2 + 6 + 6 = i+10 and factored into a triangle at time t + 14. By 
induction, tile (k + 2, k) is zeroed out at time t + 6, hence triangularized at time 
i+4. The corresponding UNMQR update of tile (fc + 2, fc + 1) ends at time < + 10, 
its TTMQR update ends at time max(t + 14, t + 10) + 6 = t + 20. Hence tile 
(fc + 2, k + 1) can indeed be zeroed out at time max(t + 12, t + 20) + 2 = t + 22. 
A new tile in column fc + 1 can be zeroed out every 6 time units, hence the 
induction formula for k + 1 . 

Finally, for a square matrix of size q x q, consider the above formula for a 
rectangular matrix with p = q + 1. Instead of zeroing out the last tile (q + 1, q) 
with tile (q, g), simply need to factor tile (<?, q) into a triangle with GEQRT(q, q). 
This costs 4 time units instead of 6 when adding TTQRT(q + l,q, q), and 
explains the difference of 2 in the formula for square matrices. 

Proof of (2). Fibonacci and Greedy are more difficult to analyze than 
FlatTree, but we provide an upper bound of their critical path. The approach 
is the same for both algorithms, and hereafter Alg denotes either Fibonacci 
or Greedy. Let coarse(i,k) be the time-step at which tile (i,k) is zeroed out 
in Alg with the coarse-grain model (see Table 2 for examples). We derive a 
"slowed down" version of the tiled version of Alg by terminating the zeroing 
out of tile (i, k) at time-step 

6coarse(2, 1) + 22(fc — 1) — 6(coarse(k, k) — coarse(i, k)). 

We say that this version is slowed down because we do not start the zeroing 
out of the tiles as soon as possible. For instance in the first column, tile (i, 1) is 
zeroed out at time 6coarse(i, fc), which is larger than the value given in Table 3. 
However, we keep the same elimination list as in the original version of Alg, 
and we trigger the update and factor operations as soon as possible when the 
zeroing out operation is completed. We only delay these latter operations. 

The intuitive idea for delaying the eliminations is that the corresponding up- 
dates will be fully overlapped, within a given column, or when proceeding from 
one column to the next: in this case, allowing for a time-shift of 22 smooths the 
chaining of the updates. The regular and repetitive spacing of the eliminations 
allows us to check (just as we did to prove (1)) that all dependencies are en- 
forced in the slowed down version of Alg. Because the case-analysis is tedious, 
we have written a program for a sanity check of the validity of Alg 2 . 

In the coarse-grain model, Alg terminates the first column in time x, so the 
critical path of its slowed down version is 6x + 22 (q — 1). For Fibonacci, x is 
the least integer such that x(x + l)/2> p — 1, hence x < \\f2p\ . For Greedy, 
x = [log 2 (p — 1)] < [log 2 p] , hence the result. 

Proof of (3). Consider a square 5x5 matrix B, with q > 2. Assume 
that there are only three non-zero sub-diagonals, i.e., that tile (i,k) is initially 

All program sources are publicly available at http://graal.ens-lyon.fr/~mjacquel/ 
tiledQR.html 
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zero in B for i > k + 3. Because there are only three non-zero tiles below the 
diagonal, there is a constant number of possible row pairings in each column. 
An exhaustive search is to try all possible pairings in the first column, followed 
by all possible pairings in the second column, and so on. After a few columns, 
a pattern emerges, and we can identify that any optimal algorithm (there are 
several of them) needs at least 22 time-steps to proceed from one column to the 
next. It is possible to save a few steps at the beginning and end of the execution, 
and the optimal critical path is 22g — 30. Here also, because the case-analysis 
is long and tedious, we have written a program for a sanity check of the latter 
value. 

Now we show that the optimal critical path for a general p x q matrix A, 
with p > q > 2, is at least equal to the critical path of the previous q x q matrix 
B with three sub-diagonals. Indeed, Lemma 1 shows that there exist optimal 
algorithms for factoring A without any reverse elimination. Consider such an 
algorithm, and discard all eliminations that involve zeroing out elements below 
the third sub-diagonal, or outside the q x q top square: the critical path cannot 
increase, and we have an elimination scheme for B, which proves the desired 
result. 

Note that using B instead of A is the key to the proof: in each column 
of B, there is only a constant number of possible row pairings, which makes 
it possible to try all combinations for several consecutive columns. Reasoning 
with A instead would need a completely different proof (yet to be invented). 

Proof of (4) and (5). These are a direct consequence of (3) and (4). □ 

Remarks: 

1. We express all critical path lengths in terms of p and q, with an unit of 
n^/3 floating-point operations. It is easy to get critical path lengths in 
term of m, n, and rib, and with elementary floating-point operations as 
unit, assuming that all tiles are full. (In other words, m and n are multiple 
of rib.) For example for FlatTree, we get (2/3)mnj + (2/3)n^ if m > n = 
n b , 2mnl + 16/3nng - (22/3)ng if m > n > n b and (22/3)nng - (24/3)ng 
if m = n > rib- 

2. From this formula, it is clearer that, if one wants to minimize the number 
of floating-point operations on the critical path, one needs to take nj, = 1. 
However, such an action would have disastrous consequences. The com- 
munication increase would be way too high, and the increase gain in paral- 
lelism would not be worth the overhead. More importantly, the efficiency 
of the elimination kernels would be much lower. In this manuscript, we 
consider rib constant, large enough so that elimination kernels operate at 
full Level 3 BLAS performance, and so that communication costs remain 
relatively low. 

3. In the square case, we see that the critical path length of the tiled algo- 
rithms is typically in 0(nnl). This is in sharp contrast with the current 
LAPACK algorithm GEQRF. If we assume that the panel is not paral- 
lclizable, and that the block size for the LAPACK algorithm is rib, then 
counting the length of the chain of panel factorization steps leads to a 
critical path length in 0(n 2 rib). There is therefore much more parallelism 
to exploit in the tiled algorithms than in the current LAPACK algorithms. 
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Or, stated differently [5, 6], the granularity of the tiled algorithms is finer 
than that of the LAPACK algorithm. 

In Table 3 we also report time-steps for the BinaryTree algorithm. As its 
name indicates, this algorithm performs a binary tree reduction to zero out tiles 
in each column. Here is an asymptotic expression of its critical path: 

Proposition 1. Consider a tiled matrix of size px q 7 where p > q. The critical 
path length of BinaryTree is 6q log 2 p + o(q log 2 p) . 

Proof. It is possible to derive an exact expression for the critical path length of 
BinaryTree in the special case where p and q are both exact powers of two, 
with q < p. We obtain the value (10 + 61og 2 p)g — 41og 2 p — 6. As before, the 
proof goes by (tedious) induction. Here again, we have written a program for 
a sanity check of the latter value. The asymptotic value follows easily for an 
arbitrary matrix, by enlarging each dimension to the nearest power of two. □ 

Proposition 1 shows that BinaryTree is not asymptotically optimal. The 
PLASMA library provides more algorithms, that can be informally described as 
trade-offs between FlatTree and BinaryTree. (We remind the reader that 
FlatTree is the same as algorithm as Sameh-Kuck.) These algorithms are 
referred to as PlasmaTree in all the following, and differ by the value of an 
input parameter called the domain size BS . This domain size can be any value 
between 1 andp, inclusive. Within a domain, that includes BS consecutive rows, 
the algorithm works just as FlatTree: the first row of each domain acts as a 
local panel and is used to zero out the tiles in all the other rows of the domain. 
Then the domains are merged: the panel rows are zeroed out by a binary tree 
reduction, just as in BinaryTree. As the algorithm progresses through the 
columns, the domain on the very bottom is reduced accordingly, until such time 
that there is one less domain. For the case that BS = 1, PlasmaTree follows 
a binary tree on the entire column, and for BS = p, the algorithm executes a 
flat tree on the entire column. It seems very difficult for a user to select the 
domain size BS leading to best performance, but it is known that BS should 
increase as q increases. Table 3 shows the time-steps of PlasmaTree with a 
domain size of BS = 5. In the experiments of Section 4, we use all possible 
values of BS and retain the one leading to the best value. 

So far our study has only been concerned with algorithms based on TT 
kernels. Indeed, in the manuscript, FlatTree stands for TT-FlatTree. We 
now give the critical path of the algorithm TS-FlatTree. This corresponds to 
the FlatTree algorithm (i.e., Sameh-Kuck) with TS kernels. This algorithm 
was introduced in [5, 6, 15] and is available in PLASMA for performing the QR 
factorization of a matrix on multicore architecture. 

Proposition 2. The critical path length for TS-FlatTree is 

6p — 2 for p > q = 1 
12p+18q-32 forp>q>l 
30p - 34 forp = q>\ 

Proof. Consider the case of p > q = 1. In order to show that for any p, with 
q = 1, the critical path is of length 6p — 2, we shall proceed by induction on p. 
If p = q = 1, then from Table 1 the result is obtained since only GEQRT(1, 1) 
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is required. With the base case established, now assume that this holds for all 
p — 1 > q = 1. Thus at time t = 6(p — 1) — 2 = 6p — 8, we have that tile (1, 1) 
has been factorized into a triangle and for all p — 1 > i > 1, tile (i, 1) has been 
zeroed out. Therefore, tile (p, 1) will be zeroed out with TSQRT(p, 1) at time 
t + 6 = 6(p - 1) - 2 + 6 = 6p - 2. 

Assume that p > g > 1. We show by induction on k that tile (£,&), for 
i > k > 2, is zeroed out at time unit 12i + 18k — 32. Tile (2, 2) is updated from 
step k = 1 at time 6(2) — 2 + 12 = 22, it is factored into a triangle at time 28. 
Tile (3, 2) is zeroed out at time 28 + 12 = 40 = 12 x 3 + 18 x 2 - 32, and a new 
tile in column 2 is zeroed out every 12 time units, hence the initialization of the 
induction for k = 2. 

Assume now that the formula holds up to column k, and let t = 12(fc + 1) + 
18k — 32 be the time at which tile (k + l,k) is zeroed out. Tile (k + 1, k + 1) is 
updated from step k at time t + 12 and factored into a triangle at time t + 18. 
By induction, tile (k + 2, k) is zeroed out at time t + 12. Hence tile (k + 2, k + 1) 
can indeed be zeroed out at time max(i + 12, t + 18) + 12 = t + 30. A new 
tile in column k + 1 can be zeroed out every 12 time units, hence the induction 
formula for k + 1 . 

For a square matrix of size q x q, consider the above formula for a rectangular 
matrix with p = q + 1. Instead of zeroing out the last tile (q + l,q) with tile 
(q, q) in 6 time units with TSQRT(q + 1, q), we simply need to factor tile (q, q) 
into a triangle with GEQRT(q, q). This costs 4 time units instead of 6, and 
explains the difference of 2 in the formula for square matrices. □ 

As we can see, the critical path of TS-FlatTree (Proposition 2) is longer 
than the one of FlatTree (Theorem 1(1)). This stems from the facts that (1) 
a TS algorithm can be converted into a TT algorithm, and (2) this conversion 
increases the parallelism, and, consequently, reduces the critical path length. 

4 Experimental results 

All experiments were performed on a 48-core machine composed of eight hexa- 
core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 
GHz. Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6 
Gflop/s for the whole machine. The Istanbul micro-architecture is a NUMA 
architecture where each socket has 6 MB of level-3 cache and each processor has 
a 512 KB level-2 cache and a 128 KB level-1 cache. After having benchmarked 
the AMD ACML and Intel MKL BLAS libraries, we selected MKL (10.2) since 
it appeared to be slightly faster in our experimental context. Linux 2.6.32 and 
Intel Compilers 11.1 were also used in conjunction with PLASMA 2.3.1. 

For all results, we show both double and double complex precision, using all 
48 cores of the machine. The matrices are of size m = 8000 and 200 < n < 8000. 
The tile size is kept constant at rib = 200, so that the matrices can also be viewed 
as p x q tiled matrices where p = 40 and 1 < q < 40. All kernels use an inner 
blocking parameter of if, = 32. 

Asymptotically all operations in a QR factorization are FMAs ( 11 fused multiply- 
add", y <— ax + y). In real arithmetic, an FMA involves three double precision 
numbers for two flops, but these two flops can be combined into one FMA in- 
struction and thus completed in one cycle. In complex arithmetic, the operation 



16 



y <— ax + y involves six double precision numbers for eight flops; we also note 
that there is no such thing as a complex-arithmetic FMA. The ratio of com- 
putation/communication is therefore, potentially, four times higher in complex 
arithmetic than in real arithmetic. Communication aware algorithms are much 
more critical in real arithmetic than in complex arithmetic. This is the reason 
why we present results in complex arithmetic and in real arithmetic. Our new 
algorithms will be at their best in the complex arithmetic case where paral- 
lelism is most important while communication less. In the real arithmetic case, 
we will see that TS kernels which perform potentially less communication than 
TT kernels have the advantage as soon as there is enough parallelism from the 
algorithm (q large enough). 

The PLASMA interface allows one to specify the dependencies between tasks 
by designating the data as either INPUT, OUTPUT, INOUT, or NODEP. Cur- 
rently, the update kernels ( UNMQR, TTMQR, and TSMQR) introduced false 
dependencies between the tasks which sequentializes the execution of update 
with the factorization kernels TTQRT or TSQRT. In order to alleviate these, 
we altered the dependency designation within each of the update kernels for 
the matrix of Householder reflectors, V, from INPUT to NODEP as is further 
explained in [13]. The dependencies between the tasks are still consistent since 
the T matrix within each update kernel continues to be designated as INPUT 
so that any subsequent task which overwrites this T matrix cannot be executed. 

For each experiment, we provide a comparison of the theoretical performance 
to the actual performance. The theoretical performance is obtained by model- 
ing the limiting factor of the execution time as either the critical path, or the 
sequential time divided by the number of processors. This is similar in approach 
to the Roofline model [19]. Taking 7 seq as the sequential performance, T as the 
total number of flops, cp as the length of the critical path, and P as the number 
of processors, the predicted performance, 7 pre d, is 



Figures la and lc depict the predicted performance of all algorithms which use 
the Triangle on top of triangle kernels. For double complex precision, sequential 
kernels reach 3.1860 GFlop/s while in double precision, the peak performance is 
3.8440 GFlop/s. Since PlasmaTree provides an additional tuning parameter 
of the domain size, we show the results for each value of this parameter as well 
as the composition of the best of these domain sizes. Again, it is not evident 
what the domain size should be for the best performance, hence our exhaustive 
search. 

Part of our comprehensive study also involved comparisons made to the 
Semi-Parallel Tile and Fully-Parallel Tile CAQR algorithms found in [11] which 
are much the same as those found in PLASMA. As with PLASMA, the tuning 
parameter BS controls the domain size upon which a flat tree is used to zero 
out tiles below the root tile within the domain and a binary tree is used to 
merge these domains. Unlike PLASMA, it is not the bottom domain whose size 
decreases as the algorithm progresses through the columns, but instead is the 
top domain. In this study, we found that the PLASMA algorithms performed 
identically or better than these algorithms and therefore we do not report these 
comparisons. 
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Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ] Best domain size for PlASMaTree (TT) = [1 5 5 5 17 28 8] 




(a) Predicted (double complex) (b) Experimental (double complex) 



Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ] Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19] 




(c) Predicted (double) (d) Experimental (double) 



Figure 1: Predicted and experimental performance of QR factorization - Triangle on top of triangle kernels 
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Figure 2: Overhead in terms of critical path length and time with respect to Greedy (Greedy = 1) 
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(a) Theoretical CP length (b) Experimental (double complex) (c) Experimental (double) 

Figure 3: Detailed view of the overhead in terms of critical path length and time with respect to Greedy (Greedy = 1) 



Figure lb and Id illustrate the experimental performance reached by 
Greedy, Fibonacci and PlasmaTree algorithms using the TT (Triangle on 
top of triangle) kernels. In both cases, double or double complex precision, the 
performance of Greedy is better than PlasmaTree even for the best choice 
of domain size. Moreover, as expected from the analysis in Section 3.2, Greedy 
outperforms Fibonacci the majority of the time. Furthermore, we see that, 
for rectangular matrices, the experimental performance in double complex pre- 
cision matches the prediction. This is not the case for double precision because 
communications have higher impact on performance. 

While it is apparent that Greedy does achieve higher levels of performance, 
the percentage may not be as obvious. To that end, taking Greedy as the 
baseline, we present in Figure 2 the theoretical, double, and double complex 
precision overhead for each algorithm that uses the Triangle on top of triangle 
kernel as compared to Greedy. These overheads are respectively computed in 
terms of critical path length and time. At a smaller scale (Figure 3), it can be 
seen that Greedy can perform up to 13.6% better than PlasmaTree. 

For all matrix sizes considered, p = 40 and 1 < q < 40, in the theoreti- 
cal model, the critical path length for Greedy is either the same as that of 
PlasmaTree (q = 1) or is up to 25% shorter than PlasmaTree (q = 6). 
Analogously, the critical path length for Greedy is at least 2% to 27% shorter 
than that of Fibonacci. In the experiments, the matrix sizes considered were 
p = 40 and q <E {1,2,4,5,10,20,40}. In double precision, Greedy has a de- 
crease of at most 1.5% than the best PlasmaTree (q = 1) and a gain of at 
most 12.8% than the best PlasmaTree (q = 5). In double complex precision, 
Greedy has a decrease of at most 1.5% than the best PlasmaTree (q = 1) 
and a gain of at most 13.6% than the best PlasmaTree (q = 2). Similarly, 
in double precision, Greedy provides a gain of 2.6% to 28.1% over Fibonacci 
and in double complex precision, Greedy has a decrease of at most 2.1% and 
a gain of at most 28.2% over Fibonacci. 

Although it is evidenced that PlasmaTree does not vary too far from 
Greedy or Fibonacci, one must keep in mind that there is a tuning parameter 
involved and we choose the best of these domain sizes for PlasmaTree to 
create the composite result, whereas with Greedy, there is no such parameter 
to consider. Of particular interest is the fact that Greedy always performs 
better than any other algorithm 3 for p S> q. In the scope of PlasmaTree, a 
domain size BS = 1 will force the use of a binary tree so that both Greedy and 
PlasmaTree behave the same. However, as the matrix tends more to a square, 
i.e., q tends toward p, we observe that the performance of all of the algorithms, 
including FlatTree, are on par with Greedy. As more columns are added, 
the parallelism of the algorithm is increased and the critical path becomes less 
of a limiting factor, so that the performance of the kernels is brought to the 
forefront. Therefore, all of the algorithms are performing similarly since they 
all share the same kernels. 

In order to accurately assess the impact of the kernel selection towards the 
performance of the algorithms, Figures 4 and 5 show both the in cache and 
out of cache performance using the No Flush and MultCallFlushLRU strategies 
as presented in [2, 18]. Since an algorithm using TT kernels will need to call 

3 When q = 1, Greedy and FlatTree exhibit close performance. They both perform a 
binary tree reduction, albeit with different row pairings. 
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Figure 4: Kernel performance for double complex precision 
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GEQRT as well as TTQRT to achieve the same as the TS kernel TSQRT, the 
comparison is made between GEQRT + TTQRT and TSQRT (and similarly 
for the updates). For rib = 200, the observed ratio for in cache kernel speed 
for TSQRT to GEQRT + TTQRT is 1.3374, and for TSMQR to UNMQR 
+ TTMQR is 1.3207. For out of cache, the ratio for TSQRT to GEQRT + 
TTQRT is 1.3193 and for TSMQR to UNMQR + TTMQR it is 1.3032. Thus, 
we can expect about a 30% difference between the selection of the kernels, since 
we will have instances of using in cache and out of cache throughout the run. 
Most of this difference is due to the higher efficiency and data locality within 
the TT kernels as compared to the TS kernels. 

Having seen that kernel performance can have a significant impact, we also 
compare the TT based algorithms to those using the TS kernels. The goal is 
to provide a complete assessment of all currently available algorithms, as shown 
in Figure 6. For double precision, the observed difference in kernel speed is 
4.976 GFLOP/sec for the TS kernels versus 3.844 GFLOP/sec for the TT ker- 
nels which provides a ratio of 1.2945 and is in accordance with our previous 
analysis. It can be seen that as the number of columns increases, whereby the 
amount of parallelism increases, the effect of the kernel performance outweighs 
the benefit provided by the extra parallelism afforded through the TT algo- 
rithms. Comparatively, in double complex precision, Greedy docs perform 
better, even against the algorithms using the TS kernels. As before, one must 
keep in mind that Greedy does not require the tuning parameter of the domain 
size to achieve this better performance. 

From these experiments, we showed that in double complex precision, 
Greedy demonstrated better performance than any of the other algorithms 
and moreover, it does so without the need to specify a domain size as opposed 
to the algorithms in PLASMA. In addition, in double precision, for matrices 
where p > ^, Greedy continues to excel over any other algorithm using the 
TT kernels, and continues to do so as the matrices become more square. 

5 Conclusion 

In this manuscript, we have presented Fibonacci, and Greedy, two new al- 
gorithms for tiled QR factorization. These algorithms exhibit more parallelism 
than state-of-the-art implementations based on reduction trees. We have pro- 
vided accurate estimations for the length of their critical path, and we have 
proven that they were asymptotically optimal for a wide class of matrix shapes, 
including all cases where the number of tile rows p and tile columns q are pro- 
portional, p = Xq, A > 1. To the best of our knowledge, this proof is the first 
complexity result in the field of tiled algorithms, and it lays the theoretical 
foundations for a comparative study of these algorithms. 

Comprehensive experiments on multicore platforms confirm the superiority 
of the new algorithms for p x q matrices, as soon as, say, p > 2q. This holds 
true when comparing not only with previous algorithms using TT ( Triangle 
on top of triangle) kernels, but also with all known algorithms based on TS 
( Triangle on top of square) kernels. Given that TS kernels offer more locality, 
and benefit from better elementary arithmetic performance, than TT kernels, 
the better performance of the new algorithms is even more striking, and further 
demonstrates that a large degree of a parallelism was not exploited in previously 
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Figure 6: Predicted and experimental performance of QR factorization - All kernels 
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Figure 8: Detailed view of the overhead in terms of critical path length and time with respect to Greedy (Greedy = 1) 
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Table 5: Greedy versus PlasmaTree (TT) and Fibonacci (Theoretical) 
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Table 6: Greedy versus PlasmaTree (TT) (Experimental Double) 
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Table 7: Greedy versus PlasmaTree (TT) (Experimental Double Complex) 
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Table 8: Greedy versus Fibonacci (Experimental Double) 
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Table 9: Greedy versus Fibonacci (Experimental Double Complex) 
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published solutions. 

Future work will investigate several promising directions. First, using rect- 
angular tiles instead of square tiles could lead to efficient algorithms, with more 
locality and still the same potential for parallelism. Second, refining the model 
to account for communications, and extending it to fully distributed architec- 
tures, would lay the ground to the design of MPI implementations of the new 
algorithms, unleashing their high level of performance on larger platforms. Fi- 
nally, the design of robust algorithms, capable of achieving efficient performance 
despite variations in processor speeds, or even resource failures, is a challenging 
but crucial task to fully benefit from future platforms with a huge number of 
cores. 
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